The FIS team and I have begun to examine the role of transformative investments in the District. Our first pass at this analysis was presented at the 2013 Fall NTA Meeting, and we need to clean a few items up for a proceedings submission. This script is intended to provide some exploratory graphics to capture some economic context over the study period.
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import pysal as ps
import geopandas as gp
from IPython.display import HTML
import seaborn as sb
from pandas.tools.plotting import scatter_matrix
import statsmodels.api as sm
from mpltools import color
import scipy
#Set print width
pd.set_option('line_width',100)
The data for this exercise will come from the annual summary stats by neighborhood compiled in the following files in '/home/choct155/ORA/EconDev':
First, let's read in the data.
#Establish working directory
workdir='/home/choct155/ORA/EconDev/'
#Read in data
mean_temp=pd.read_csv(workdir+'edev_mean.csv',index_col=['aneigh','year']).sortlevel(0)
med_temp=pd.read_csv(workdir+'edev_median.csv',index_col=['aneigh','year']).sortlevel(0)
#Slice to exclude empty neighborhoods
mean=mean_temp.ix[('16th Street Heights',2001):('Woodridge',2011)]
med=med_temp.ix[('16th Street Heights',2001):('Woodridge',2011)]
#Combine sets
stats=mean.join(med[['50%_iit','50%_prop']])
stats.head()
The above data must be augmented with spatial references. We can grab said references using the assessment neighborhoood shapefile.
#Read in spatial data
nbhd_temp=gp.GeoDataFrame.from_file(workdir+'as_nbhd.shp').rename(columns={'NEIGHBORHO':'aneigh'})
#Set neighborhoods as row names
nbhd=nbhd_temp.set_index('aneigh')
nbhd.head()
Now we need to merge this information together. To make this happen, we will need to split the economic data by year, and merge it with the spatial information. All told, we will create 11 GeoDataFrames. Since we are walking before running here, let's test the 2001 case.
#Join 2001 economic data with spatial references
nbhd01=gp.GeoDataFrame(nbhd.join(stats.xs(2001,level='year')))
nbhd01.head()
Does it plot?
nbhd01.plot(column='mean_iit', scheme='QUANTILES', k=3, colormap='OrRd')
Doesn't look quite right. I am wondering if it's an issue with GeoPandas or the data. Do pre-merge values plot appropriately?
nbhd.plot(column='NBHD_NUM', scheme='QUANTILES', k=3, colormap='OrRd')
They appear to do so. What do the underlying income data look like in 2001?
#Plot distribution
fig,axes=plt.subplots(figsize=(16,6))
#Plot average AGI distribution
sb.kdeplot(nbhd01['mean_iit'][nbhd01['mean_iit'].notnull()],ax=axes,shade=True)
sb.rugplot(nbhd01['mean_iit'][nbhd01['mean_iit'].notnull()],ax=axes,color='#999999')
#Set title
axes.set_title('Average AGI by Assessment Neighborhood (2001)',fontsize=16)
#Set background color
axes.patch.set_facecolor('white')
The plot above captures the kernel density estimate of the mean_iit distribution. Nothing looks all that strange. The spacing looks a little funny because of the relatively low number of observations (maximum of 70 per year).
Perhaps it has to do with the number of discrete categories. Let's change that up.
plt.rcParams['figure.figsize']=18,20
fig,axes=plt.subplots(2,2,sharex=True,sharey=True)
nbhd01.plot(column='mean_iit', scheme='QUANTILES', k=3, colormap='OrRd',axes=axes[0,0])
axes[0,0].set_title('k=3')
nbhd01.plot(column='mean_iit', scheme='QUANTILES', k=4, colormap='OrRd',axes=axes[0,1])
axes[0,1].set_title('k=4')
nbhd01.plot(column='mean_iit', scheme='QUANTILES', k=5, colormap='OrRd',axes=axes[1,0])
axes[1,0].set_title('k=5')
nbhd01.plot(column='mean_iit', scheme='QUANTILES', k=6, colormap='OrRd',axes=axes[1,1])
axes[1,1].set_title('k=6')
plt.suptitle('Distribution of mean_iit by Quantile Bin Count',fontsize=20)
I believe we have identified the source of the problem. We should be specifying quintiles with k=5. It would also be useful to see what the different classification schemes offer. In addition to 'QUANTILES', the following classification rules can be employed:
#Set plot size
plt.rcParams['figure.figsize']=18,9
#Create plotting object
fig,axes=plt.subplots(1,2,sharex=True,sharey=True)
#Plot Fisher-Jenks and change bg
nbhd01.plot(column='mean_iit', scheme='fisher_jenks', k=8, colormap='OrRd',axes=axes[0])
axes[0].set_title('Fisher-Jenks')
axes[0].patch.set_facecolor('white')
#Plot Equal Interval and change bg
nbhd01.plot(column='mean_iit', scheme='equal_interval', k=8, colormap='OrRd',axes=axes[1])
axes[1].set_title('Equal Interval')
axes[1].patch.set_facecolor('white')
It looks like extreme values on the high-end are probably disrupting things. For Fisher-Jenks, the scheme seeks to minimize variance within groups while maximizing variance across groups. I suspect to many observations on the low end fall into a single category in an effort to accommodate the very high variance among the highest values. For Equal Interval, the gap between the highest values and more typical values is probably consuming multiple groups. The Quantile approach looks like a winner.
Now we have an approach that can be used to examine these distribution plots over time. The first order of business is generating GeoDataFrames for each year.
#Create container for GDFs
nbhd_gdf_list=[]
#Create container for years
yr_list=[]
#For each year in 2001-2011...
for yr in range(2001,2012):
#...grab the year...
yr_list.append(yr)
#...and the newly merged GDF
nbhd_gdf_list.append(gp.GeoDataFrame(nbhd.join(stats.xs(yr,level='year'))))
#Capture all GDFs in a dictionary
nbhd_dict=dict(zip(yr_list,nbhd_gdf_list))
nbhd_dict[2001].head()
Now we can plot the quintile distribution over time. It would be useful to directly compare the income (adjusted gross income) and property (assessed value) data. It's difficult to see annual trends, so we will subset to three years: [2001,2006,2011]. It would also be useful to capture the difference over the 2001-2011 period in one plot.
#Calculate difference across the time period
dstats=stats.xs(2011,level='year')-stats.xs(2001,level='year')
#Create new GDF
nbhd_diff=gp.GeoDataFrame(nbhd.join(dstats))
nbhd_diff
With the difference calculated, we can move ahead to plotting.
nbhd_dict[2006][nbhd_dict[2006]['mean_iit'].isnull()]['mean_iit']
#Set plot size
plt.rcParams['figure.figsize']=18,32
#Update nbhd_dict to include the difference information
nbhd_dict['diff']=nbhd_diff
#Create list of year subset
yr_sub=[2001,2006,2011,'diff']
#Create plotting object
fig,axes=plt.subplots(len(yr_sub),2)
#For each year in 2001-2011...
for i in range(len(yr_sub)):
#...plot the distribution of mean_iit...
nbhd_dict[yr_sub[i]].plot(column='mean_iit', scheme='QUANTILES', k=5, colormap='Blues',axes=axes[i,0])
axes[i,0].set_title('Average AGI in '+str(yr_sub[i]))
axes[i,0].patch.set_facecolor('white')
axes[i,0].set_aspect(1)
axes[i,0].set_axis_off()
#...and the distribution of mean_prop
nbhd_dict[yr_sub[i]].plot(column='mean_prop', scheme='QUANTILES', k=5, colormap='Reds',axes=axes[i,1])
axes[i,1].set_title('Average Assessment Value in '+str(yr_sub[i]))
axes[i,1].patch.set_facecolor('white')
axes[i,1].set_aspect(1)
axes[i,1].set_axis_off()
#Remove unused space
plt.tight_layout()
#Adjust titles on last two plots
axes[3,0].set_title('Average AGI Difference (2001-2011)')
axes[3,1].set_title('Average Assessment Value Difference (2001-2011)')
#Set image location on disk
img_dir='/home/choct155/ORA/EconDev/'
plt.savefig(img_dir+'AvgTaxBase_nbhd.png')
The general eastward trend is expected. It is interesting to note, however, that most of the income growth has been captured in the middle of the city, while property growth has been greater in the more residential areas of Ward 5 and surrounding areas.
What about the median values?
#Create plotting object
fig,axes=plt.subplots(len(yr_sub),2)
#For each year in 2001-2011...
for i in range(len(yr_sub)):
#...plot the distribution of mean_iit...
nbhd_dict[yr_sub[i]].plot(column='50%_iit', scheme='QUANTILES', k=5, colormap='Blues',axes=axes[i,0])
axes[i,0].set_title('Median AGI in '+str(yr_sub[i]))
axes[i,0].patch.set_facecolor('white')
axes[i,0].set_aspect(1)
axes[i,0].set_axis_off()
#...and the distribution of mean_prop
nbhd_dict[yr_sub[i]].plot(column='50%_prop', scheme='QUANTILES', k=5, colormap='Reds',axes=axes[i,1])
axes[i,1].set_title('Median Assessment Value in '+str(yr_sub[i]))
axes[i,1].patch.set_facecolor('white')
axes[i,1].set_aspect(1)
axes[i,1].set_axis_off()
#Remove unused space
plt.tight_layout()
#Adjust titles on last two plots
axes[3,0].set_title('Median AGI Difference (2001-2011)')
axes[3,1].set_title('Median Assessment Value Difference (2001-2011)')
plt.savefig(img_dir+'MedianTaxBase_nbhd.png')
While average assessment values have grown in the 'Near Northeast', the median values do not display the same action. Median values did have noticeable growth along the NW/NE border, which reinforced the eastward shift. On the other hand, median incomes values cluster in the center of the city just like mean values.
What about business license activity? The autoregressive nature of this figure is reduced because it captures license growth (unlike the stock characteristics captured by adjusted gross income and assessment value). Consequently, we will look at each year over the 2001-2011 period.
#Set plot size
plt.rcParams['figure.figsize']=18,25
#Create plotting object
fig,axes=plt.subplots(4,3)
#For each year in 2001-2011...
for i in range(len(yr_list)):
#...plot the distribution of mean_iit...
nbhd_dict[yr_list[i]].plot(column='num_bbl', scheme='QUANTILES', k=5, colormap='Purples',axes=axes[i//3,i%3],alpha=.5)
#...set face color to white and...
axes[i//3,i%3].patch.set_facecolor('white')
#...set titles
axes[i//3,i%3].set_title(str(yr_list[i]))
#...turn off those pesky axes...
axes[i//3,i%3].set_axis_off()
#...and set the aspect ratio
axes[i//3,i%3].set_aspect(1)
#Set super title
plt.suptitle('Distribution of Basic Business Licenses',fontsize=22)
plt.subplots_adjust(top=1.85)
plt.tight_layout()
plt.axis('off')
plt.savefig(img_dir+'BBL_nbhd.png')
Interesting, the patterns for business development do not appear to shift much over time at first blush.
Which areas have received the greatest amount of investment?
#Sum exp by neighborhood
exp_tot=stats['exp'].groupby(level='aneigh').sum()
#Join with GDF
nbhd['exp_tot']=exp_tot.ix[nbhd.index]
#Plot investment over the 2001-2011 time period
plt.rcParams['figure.figsize']=15,15
fig=plt.figure()
ax = fig.add_subplot(111)
ax.patch.set_facecolor('white')
nbhd.fillna(0).plot(column='exp_tot', scheme='Quantiles', k=9, colormap='Greens',axes=ax)
ax.set_title('Distribution of Total Expenditure by Neighborhood (2001-2011)')
ax.set_aspect(1)
ax.set_axis_off()
plt.savefig(img_dir+'TransInvest_nbhd.png')
Now that we have an idea of the dynamics of value levels, it would be interesting to explore the growth trajectory of these neighborhoods. Specifically, we want to know the following:
The latter seeks to understand shifts in growth profiles. For our purposes, 'era' refers to one of three collections of years:
Obviously these 'eras' are not easily separated. The idea is just to provide some allowance for differing growth regimes. To construct them, we will fold the stats DF into the three periods (aggregating via mean), and merge them in with the nbhd GeoDataFrame.
#Create copy of stats, but capturing percentage change
stats_era=stats.pct_change().reset_index().copy()
#Create mapping dictionary for eras
era_map_dict=dict({2001:0,
2002:0,
2003:0,
2004:0,
2005:1,
2006:1,
2007:1,
2008:2,
2009:2,
2010:2,
2011:2})
#Recode years to eras
stats_era['era']=stats_era['year'].map(era_map_dict)
#Groupby neighborhood and era, aggregating by mean
s_era=stats_era.groupby([stats_era['aneigh'],stats_era['era']]).mean()
#Create container for new GDFs
era_gdf_list=[]
#Create container for era labels
era_list=[]
#For each era...
for era in s_era.index.get_level_values(level='era'):
#...capture the era...
era_list.append(era)
#...and capture a new era-specific GDF...
era_gdf_list.append(gp.GeoDataFrame(nbhd.join(s_era.xs(era,level='era'))))
#Capture new era-specific GDFs in a dictionary
era_dict=dict(zip(era_list,era_gdf_list))
era_dict[1].head()
It would also be usefult to know the economic potential of these neighborhoods to some extent. We can do this by assigning each neighborhood a quintile ranking pegged to the 2001 mean_iit value via a dictionary map from the neighborhood values in the index. (We could, of course, choose a different value for this filter if desired.) Note that we will want to color by quintile to add an informational dimension to the plots. We will, therefore, include the set of colors used directly into our newly augmented GDFs.
UPDATE (2/27/14): In an effort to generate consistent color schemes for the same process, two color parameters are added (one for AGI and one for assessment value). The colors vary linearly with the value of average AGI in 2001, but the data are extreme coded at 5% and 95%.
#Subset to 2001
stat01=stats.xs(2001,level='year')
#Capture quintile groupings
stat01['q_econ']=pd.qcut(stats.xs(2001,level='year')['mean_iit'],5)
#Generate dictionary mapping quintile groupings to clearer labels
q_dict=dict({'[4376.277, 24504.371]':'q1',
'(24504.371, 35631.989]':'q2',
'(35631.989, 61488.399]':'q3',
'(61488.399, 109608.369]':'q4',
'(109608.369, 916785.75]':'q5'})
#Generate a dictionary mapping colors to quintiles
qcolor_dict=dict({'q1':'#EDF8E9',
'q2':'#BAE4B3',
'q3':'#74C476',
'q4':'#31A354',
'q5':'#006D2C'})
#Assign new quintile labels
stat01['q_start']=stat01['q_econ'].map(q_dict)
#Assign colors
stat01['colors']=stat01['q_start'].map(qcolor_dict)
#Truncate 2001 average AGI (for color display)
##Set bounds
ceiling=scipy.stats.scoreatpercentile(stat01['mean_iit'],95)
floor=scipy.stats.scoreatpercentile(stat01['mean_iit'],5)
##Create container for new values
inc_list=[]
##Recode
for iit in stat01['mean_iit']:
inc=None
if iit > ceiling:
inc=ceiling
elif iit < floor:
inc=floor
else:
inc=iit
inc_list.append(inc)
#Generate 2001 version of mean_iit
stat01['mean_iit2001']=Series(inc_list,index=stat01.index)
#Generate new color maps, specific to AGI and assessment value
agi_col_map=color.color_mapper((floor,ceiling),cmap='Blues',start=.2)
aval_col_map=color.color_mapper((floor,ceiling),cmap='Reds',start=.2)
#Generate colors based upon truncated mean_iit
stat01['agi_col']=stat01['mean_iit2001'].apply(lambda x: agi_col_map(x))
stat01['aval_col']=stat01['mean_iit2001'].apply(lambda x: aval_col_map(x))
#Define new function to extreme code indicators of interest
def extreme(x):
'''Caps percentage change on [-2,2] interval'''
#Create container for new values
x_new_list=[]
#For each value...
for i in range(len(x)):
#...if less than -2 cap at -2...
if x[i]<-2:
x_new_list.append(-2.0)
#...if between -2 and 2 leave as is...
elif (x[i]>-2) & (x[i]<=2):
x_new_list.append(x[i])
#...otherwise cap at 2
elif x[i]>2:
x_new_list.append(2)
else:
x_new_list.append(0)
return x_new_list
#Create new container for augmented era GDFs
era_list2=[]
#Define desired columns
era_cols=['q_start','colors','agi_col','aval_col']
#For df in era_dict...
for df in era_dict.values():
#...add in q_start...
df_new=pd.merge(df,stat01[era_cols],how='left',left_index=True,right_index=True)
#...add in extreme coded values...
df_new['mean_iit_ec']=extreme(df_new['mean_iit'])
df_new['50%_iit_ec']=extreme(df_new['50%_iit'])
df_new['mean_prop_ec']=extreme(df_new['mean_prop'])
df_new['50%_prop_ec']=extreme(df_new['50%_prop'])
#...add in colors...
#...and throw it in the list
era_list2.append(df_new)
#Create new dict
era_dict2=dict(zip([1,2,3],era_list2))
era_dict2[1].head().T
First of all, what were the growth rankings in each of the eras? We can look at each era in turn, comparing adjusted gross income with assessment value.
def growth_compare(i,mean_med):
#Generate plot space
fig,axes=plt.subplots(1,2,sharey=True)
#Create new GDF that excludes empty colors
df_temp=era_dict2[i][era_dict2[i]['colors'].notnull()].sort(columns='mean_iit')
#Identify color values
colors=list(df_temp['colors'].values)
#Plot AGI and assessment value
df_temp[mean_med+'_iit_ec'].plot(kind='barh',ax=axes[0],color=list(df_temp['agi_col'].values),title='AGI Growth')
axes[0].set_ylabel('Assessment Neighborhood')
df_temp[mean_med+'_prop_ec'].plot(kind='barh',ax=axes[1],color=list(df_temp['aval_col'].values),title='Assessment Value Growth')
axes[1].set_ylabel('')
growth_compare(1,'mean')
plt.suptitle('Average Annual Growth During Era #1 (2001-2004)',fontsize=20)
plt.savefig(img_dir+'AvgGrowthBar1_nbhd.png')
growth_compare(1,'50%')
plt.suptitle('Median Annual Growth During Era #1 (2001-2004)',fontsize=20)
plt.savefig(img_dir+'MedianGrowthBar1_nbhd.png')
Interesting. Early on there appeared to be little correlation between the growth rates in AGI and the growth rates in assessment value. It does appear, however, that there is some correlation between starting mean AGI and growth rate. Darker colors indicate higher quintiles.
Similar stories seem to arise in the median data as well. There are some curious deviations however. Neighborhoods like Cleveland Park, Georgetown, and Foggy Bottom experienced marginal growth in median AGI despite material increases in mean AGI. In contrast, Washington Navy Yard and DC Village saw dramatic increases in median income without corresponding increases mean income. Clearly the shape of the income profile has changed dramatically in some neighborhoods. Assessment values, however, appear to be characterized by more even growth.
How does the picture look in the second era?
growth_compare(2,'mean')
plt.suptitle('Average Annual Growth During Era #2 (2005-2007)',fontsize=20)
plt.savefig(img_dir+'AvgGrowthBar2_nbhd.png')
growth_compare(2,'50%')
plt.suptitle('Median Annual Growth During Era #2 (2005-2007)',fontsize=20)
plt.savefig(img_dir+'MedianGrowthBar2_nbhd.png')
The second era is characterized by a substantial decline in variation, for both AGI and assessed value. Furthermore, of those experiencing positive growth in AGI, the magnitude of said growth decline. Assessed values, on the other hand, experienced virtually positive growth for all neighborhoods. The correlation between AGI growth and assessed value growth still appears to be weak at best.
Comparing the mean and median plots reveals distributions that, by and large, became more skewed towards higher incomes and assessment values during this period (as evidenced by mean growth outpacing median growth in most neighborhoods). This is perhaps unsurprising, fitting a national narrative about capital wealth appreciation in the lead up to the Recession.
What about the Great Recession and recovery era?
growth_compare(3,'mean')
plt.suptitle('Average Annual Growth During Era #3 (2008-2011)',fontsize=20)
plt.savefig(img_dir+'AvgGrowthBar3_nbhd.png')
growth_compare(3,'50%')
plt.suptitle('Median Annual Growth During Era #3 (2008-2011)',fontsize=20)
plt.savefig(img_dir+'MedianGrowthBar3_nbhd.png')
Growth all but vanishes in this time period, and the correlation between higher starting AGI and higher growth seems to have dissipated as well. Furthermore, aside from a sharp downward skew in Massachusetts Avenue Heights, the story is consistent across mean and median data.
Mapping these data would aid in identifying clustered growth activity. We can map both AGI and assessed value by era.
#Set plot dimensions
plt.rcParams['figure.figsize']=18,24
#Generate plotting object
fig,axes=plt.subplots(len(era_dict2),2)
#For each era...
for i in era_dict2:
#...plot the distribution of mean_iit growth...
gp.GeoDataFrame(era_dict2[i]).plot(column='mean_iit', scheme='QUANTILES', k=5, colormap='Blues',axes=axes[i-1,0])
axes[i-1,0].set_title('Average Growth in Average AGI - Era #'+str(i))
axes[i-1,0].patch.set_facecolor('white')
axes[i-1,0].set_axis_off()
axes[i-1,0].set_aspect(1)
#...and the distribution of mean_prop growth
gp.GeoDataFrame(era_dict2[i]).plot(column='mean_prop', scheme='QUANTILES', k=5, colormap='Reds',axes=axes[i-1,1])
axes[i-1,1].set_title('Average Growth in Average Assessment Value - Era #'+str(i))
axes[i-1,1].patch.set_facecolor('white')
axes[i-1,1].set_axis_off()
axes[i-1,1].set_aspect(1)
plt.tight_layout()
plt.savefig(img_dir+'AvgGrowthMap_era_nbhd.png')
Note that the colors here do not correspond to the same groups defined above (by mean AGI quintiles in 2001). Rather, they are quintile ranges measured with the data from each era's GDF. This is the primary reason for the shift in color.
Relative income growth continues to shift from dense concentrations in the affluent northwest to pockets located in the center of the District. The eastward shift in property growth is far more dramatic. The highest growth values occur largely in the residential districts to the east of Brookland in later years.
One might wonder whether or not the median data corroborate this story.
#Generate plotting object
fig,axes=plt.subplots(len(era_dict2),2)
#For each era...
for i in era_dict2:
#...plot the distribution of mean_iit growth...
gp.GeoDataFrame(era_dict2[i]).plot(column='50%_iit', scheme='QUANTILES', k=5, colormap='Blues',axes=axes[i-1,0])
axes[i-1,0].set_title('Average Growth in Median AGI - Era #'+str(i))
axes[i-1,0].patch.set_facecolor('white')
axes[i-1,0].set_axis_off()
axes[i-1,0].set_aspect(1)
#...and the distribution of mean_prop growth
gp.GeoDataFrame(era_dict2[i]).plot(column='50%_prop', scheme='QUANTILES', k=5, colormap='Reds',axes=axes[i-1,1])
axes[i-1,1].set_title('Average Growth in Median Assessment Value - Era #'+str(i))
axes[i-1,1].patch.set_facecolor('white')
axes[i-1,1].set_axis_off()
axes[i-1,1].set_aspect(1)
plt.tight_layout()
plt.savefig(img_dir+'MedianGrowthMap_era_nbhd.png')
Indeed they do, but the eastward trend appears more pronounced on the income side, while the median assessment value growth activity has a distinctly southern flavor.
In many ways, the above analysis provides empirical support to a narrative familiar to ORA staff. The income distribution in DC is shifting spatially. New real estate opportunities are being exploited by young professionals in historically less developed neighborhoods, which disproportionately reside on the eastern side of the District. We are seeing income and property value movement that is largely consistent with a shift in the locus of economic activity.
While this is generally interesting, the motivation for this exploration is the analysis of the impact of transformative investment. It would thus be useful to explore the correlation between public investment and growth in our indicators. To be specific, we need to be able to directly compare levels of public investment and growth in AGI and assessment value. We already have the growth values, but we lost our levels during the procedure used to calculate the growth values. Consequently, we need to calculate average investment for each neighborhood in each era. To preserve the ability to compare past investment to future growth, we will capture all of the era investment averages in one DF and then merge them with the growth indicators from each of the three eras.
#Create copy of stats, but capturing levels this time
inv_era=stats.reset_index().copy()
#Recode years to eras
inv_era['era']=inv_era['year'].map(era_map_dict)
#Group investment by neighborhood and era, aggregating by mean
i_era=inv_era['exp'][inv_era['era'].notnull()].groupby([inv_era['aneigh'],inv_era['era']]).mean().fillna(0)
#Group business licenses by neighborhood and era, aggregating by sum
bbl_era=inv_era['num_bbl'][inv_era['era'].notnull()].groupby([inv_era['aneigh'],inv_era['era']]).sum().fillna(0)
#Combine investment and BBL data
inv_bbl_era=DataFrame(i_era).rename(columns={0:'inv_era'}).join(DataFrame(bbl_era).rename(columns={0:'bbl_era'}))
#Unstack to hold all eras with a single neighborhood reference
ib_era=inv_bbl_era.unstack(level='era')
#Rename columns
ib_era.columns=['inv1','inv2','inv3','bbl1','bbl2','bbl3']
ib_era.head(20)
With the investment and BBL information compiled by neighborhood and era, we can merge it back in with the existing data.
#Create container for new GDFs
era_gdf_list2=[]
#Create container for era labels
era_list2=[]
#For each era...
for era in era_dict2.keys():
#...capture the era...
era_list2.append(era)
#...and capture a new era-specific GDF...
era_gdf_list2.append(gp.GeoDataFrame(era_dict2[era].join(ib_era)))
#Capture new era-specific GDFs in a dictionary
era_dict3=dict(zip(era_list2,era_gdf_list2))
era_dict3[3].head().T
We can now compare investment in the three eras. Colorized correlation plots may be helpful.
Note that while these plots will be for primary use, there are several more detailed scatter plots and correlation matrices in the Appendix. For each era, we will evaluate the correlation between our responses (AGI and assessment value growth) and investment that occurs in the current and prior eras. Thus, for the first era responses, scatters will evaluate only first era investment. For the second era responses, scatters evaluate both the second and first era investments. In similar fashion, the third era responses are plotted against all three.
Getting back to the plots that immediately follow, unfortunately, Python sometimes makes very simple things difficult. In this case, while plotting the correlation matrix as a psuedocolors is relatively straightforward, centering diverging colors on zero is not. Why do we want to center on zero? It aids in interpretation if one color is always positive and one is always negative.
To make this happen in Python, we always have the option of explicitly defining color ramps for our particular data set. A more robust option is to define a new subclass that inherits from matplotlib.colors.Normalize. I guess I had to dive into custom classes at some point...
from matplotlib.colors import Normalize
class MidpointNormalize(Normalize):
def __init__(self, vmin=None, vmax=None, midpoint=None, clip=False):
self.midpoint = midpoint
Normalize.__init__(self, vmin, vmax, clip)
def __call__(self, value, clip=None):
# I'm ignoring masked values and all kinds of edge cases to make a
# simple example...
x, y = [self.vmin, self.midpoint, self.vmax], [0, 0.5, 1]
return np.ma.masked_array(np.interp(value, x, y))
This subclass option is the only method that allows for continuous color mapping, something we are more often interested in than not.
#Define desired columns
cor_cols1=['mean_iit','mean_prop','50%_iit','50%_prop','inv1']
#Define first era subset
e1_sub=era_dict3[1][cor_cols1].dropna()
#Generate correlation matrix (transposing captures order provided in the DF)
e1_corr=np.corrcoef(e1_sub.T)
#Set plot size
plt.rcParams['figure.figsize']=13,10
#Construct subplot
fig,ax=plt.subplots()
#Define midpoint object
norm=MidpointNormalize(midpoint=0)
#Generate plot object
e1c=ax.imshow(e1_corr,norm=norm,cmap='RdBu',interpolation='none')
#Add color bar
fig.colorbar(e1c)
#Set tick positions
labelPos=np.arange(0,len(e1_sub.columns))
#Set tick labels
pylab.xticks(labelPos,e1_sub.columns)
pylab.yticks(labelPos,e1_sub.columns)
pylab.title('Correlation Plot in Era #1 (2001-2004)');
#Define desired columns
cor_cols2=['mean_iit','mean_prop','50%_iit','50%_prop','inv1','inv2']
#Define second era subset
e2_sub=era_dict3[2][cor_cols2].dropna().replace(inf,0)
#Generate correlation matrix (transposing captures order provided in the DF)
e2_corr=np.corrcoef(e2_sub.T)
#Set plot size
plt.rcParams['figure.figsize']=13,10
#Construct subplot
fig,ax=plt.subplots()
#Generate plot object
e2c=ax.imshow(e2_corr,norm=norm,cmap='RdBu',interpolation='none')
#Add color bar
fig.colorbar(e2c)
#Set tick positions
labelPos2=np.arange(0,len(e2_sub.columns))
#Set tick labels
pylab.xticks(labelPos2,e2_sub.columns)
pylab.yticks(labelPos2,e2_sub.columns)
pylab.title('Correlation Plot in Era #2 (2005-2007)')
#Define desired columns
cor_cols3=['mean_iit','mean_prop','50%_iit','50%_prop','inv1','inv2','inv3']
#Define third era subset
e3_sub=era_dict3[3][cor_cols3].dropna()
#Generate correlation matrix (transposing captures order provided in the DF)
e3_corr=np.corrcoef(e3_sub.T)
#Set plot size
plt.rcParams['figure.figsize']=13,10
#Construct subplot
fig,ax=plt.subplots()
#Generate plot object
e3c=ax.imshow(e3_corr,norm=norm,cmap='RdBu',interpolation='none')
#Add color bar
fig.colorbar(e3c)
#Set tick positions
labelPos3=np.arange(0,len(e3_sub.columns))
#Set tick labels
pylab.xticks(labelPos3,e3_sub.columns)
pylab.yticks(labelPos3,e3_sub.columns)
pylab.title('Correlation Plot in Era #3 (2008-2011)')
Comparison of the correlation plots over the three periods reveals remarkably different behavior across time periods. It suggests that the association between the variables presented here is conditional on outside factors. This is most noticeable in the third era, with the general level of positive correlation increasing markedly. There are also a few particular items of note:
At this point, we can look at average growth rates in investment neighborhoods relative to the other neighborhoods in each era. For our purposes, we will view the existence of investment as a binary treatment. Any non-zero investment level neighborhoods go into the treatment group, while all remaining neighborhoods constitute the control. The first step is parsing out these groups for each era.
#Establish treatment groups by era
e1_treat=era_dict3[1]['inv1'][era_dict3[1]['inv1']>0].index
e2_treat=era_dict3[2]['inv2'][era_dict3[2]['inv2']>0].index
e3_treat=era_dict3[3]['inv3'][era_dict3[3]['inv3']>0].index
#Establish control groups by era
e1_control=era_dict3[1]['inv1'][era_dict3[1]['inv1']==0].index
e2_control=era_dict3[2]['inv2'][era_dict3[2]['inv2']==0].index
e3_control=era_dict3[3]['inv3'][era_dict3[3]['inv3']==0].index
For each of these groups, we need to compare the average growth rates within each era.
#Capture desired columns
comp_cols=['mean_iit','mean_prop','50%_iit','50%_prop']
#Capture average growth rates for treatment and control groups in a single DF
e1_comp=DataFrame(era_dict3[1].ix[e1_treat][comp_cols].mean()).rename(columns={0:'treatment'}).\
join(DataFrame(era_dict3[1].ix[e1_control][comp_cols].mean())).rename(columns={0:'control'})
e2_comp=DataFrame(era_dict3[2].ix[e2_treat][comp_cols].mean()).rename(columns={0:'treatment'}).\
join(DataFrame(era_dict3[2].ix[e2_control][comp_cols].mean())).rename(columns={0:'control'})
e3_comp=DataFrame(era_dict3[3].ix[e3_treat][comp_cols].mean()).rename(columns={0:'treatment'}).\
join(DataFrame(era_dict3[3].ix[e3_control][comp_cols].mean())).rename(columns={0:'control'})
#Plot comparisons for each era
plt.rcParams['figure.figsize']=15,20
fig,axes=plt.subplots(3,sharex=True)
e1_comp.plot(kind='bar',ax=axes[0],title='Growth Rate Comparisons - First Era',color=['#ef8a62','#999999'])
axes[0].patch.set_facecolor('white')
e2_comp.plot(kind='bar',ax=axes[1],title='Growth Rate Comparisons - Second Era',color=['#ef8a62','#999999'])
axes[1].patch.set_facecolor('white')
e3_comp.plot(kind='bar',ax=axes[2],title='Growth Rate Comparisons - Third Era',color=['#ef8a62','#999999'])
axes[2].patch.set_facecolor('white')
axes[2].set_xticklabels(['Average AGI','Average Assessment Value','Median AGI','Median Assessment Value'])
#Set rotation of tick labels
plt.xticks(rotation=0)
#Fix spacing
plt.tight_layout()
plt.savefig(img_dir+'GrwthCompare.png')
Wow, that's interesting. For almost every comparison, the control group grew faster. This smacks of selection bias in the identification of projects. That guy is fighting with spatial resolution to become public enemy #1.
It would be useful to capture the same plots for lagged investment scenarios:
These plots should allow us to see distancing that may occur after the investment has had a few years to take an effect.
#Capture average growth rates for treatment and control groups in a single DF
e2L1_comp=DataFrame(era_dict3[2].ix[e1_treat][comp_cols].mean()).rename(columns={0:'treatment'}).\
join(DataFrame(era_dict3[2].ix[e1_control][comp_cols].mean())).rename(columns={0:'control'})
e3L1_comp=DataFrame(era_dict3[3].ix[e1_treat][comp_cols].mean()).rename(columns={0:'treatment'}).\
join(DataFrame(era_dict3[3].ix[e1_control][comp_cols].mean())).rename(columns={0:'control'})
e3L2_comp=DataFrame(era_dict3[3].ix[e2_treat][comp_cols].mean()).rename(columns={0:'treatment'}).\
join(DataFrame(era_dict3[3].ix[e2_control][comp_cols].mean())).rename(columns={0:'control'})
#Plot comparisons for each era
plt.rcParams['figure.figsize']=15,20
fig,axes=plt.subplots(3,sharex=True)
e2L1_comp.plot(kind='bar',ax=axes[0],title='Lagged Growth Rate Comparisons - First Era Groups & Second Era Data',\
color=['#ef8a62','#999999'])
axes[0].patch.set_facecolor('white')
e3L1_comp.plot(kind='bar',ax=axes[1],title='Lagged Growth Rate Comparisons - First Era Groups & Third Era Data',\
color=['#ef8a62','#999999'])
axes[1].patch.set_facecolor('white')
e3L2_comp.plot(kind='bar',ax=axes[2],title='Lagged Growth Rate Comparisons - Second Era Groups & Third Era Data',\
color=['#ef8a62','#999999'])
axes[2].patch.set_facecolor('white')
axes[2].set_xticklabels(['Average AGI','Average Assessment Value','Median AGI','Median Assessment Value'])
#Set rotation of tick labels
plt.xticks(rotation=0)
#Fix spacing
plt.tight_layout()
plt.savefig(img_dir+'LagGrwthCompare.png')
It appears that in the second era (2005-2007), the treatment group neighborhoods did make some progress relative to the control groups. This was, however, quickly and substantially, reversed in the third time period.
Well, this supports the notion that early detection of tax base response to investment is a less than straightforward exercise. Let's take a look at some scatters.
Here is the first era (2001-2004).
#Define columns to insert into the matrix
cor_cols=['mean_iit','mean_prop','50%_iit','50%_prop','inv1','inv2','inv3']
scatter_matrix(era_dict3[1][cor_cols],alpha=.4,figsize=(20,20),diagonal='kde');
The second era (2005-2007)...
scatter_matrix(era_dict3[2][cor_cols].dropna().replace(inf,0),alpha=.4,figsize=(20,20),diagonal='kde');
...and the third era (2008-2011)
scatter_matrix(era_dict3[3][cor_cols],alpha=.8,figsize=(20,20),diagonal='kde');